Bayesian Search for Debugging: An Evaluation of BBChop

## Abstract

Introduction

The purpose of the experiment was to see whether Bayesian Search could be used for debugging. Of especial interest to me is, whether the method of Bayesian Search is still suitable for low fault probabilties or large potential fault locations.

Underlying this method is Bayesian Search theory.

Git Bisect is a valuable tool for many large projects [Citation Needed]. However, it falls short for non-deterministic bugs. Therefore, a tool like BBChop might be of importance.

Method

I created a wrapper that runs a BBChop search on a "test" which passes if the tested location is less then "L" (named fail_loc in the data) or which fails with probability fail_prob if the tested location is larger or equal to fail_loc.
To make the experiment reproducible, I seeded the pseudo-random number generator of Python and collected the seed as an input to each experiment.
The results of each experiment is the number of trials needed for BBChop to come to a solution, and the location where the fault is believed to be (where).

Infrastructure

Python 3.4.3 was used. The code was run in a virtualenv with mpmath as a further package installed. The visualisation and analysis needs pandas and seaborn packages.
BBChop from version 2d889bf was used. Additionally for more values of N, version 17b502b was used. For these two versions the core implementation did not change. However, for the linear implementation, version b32809f was used. The difference between the two versions is better explained in appendix A.
One may replicate the experiment by executing analysis.py with a list of hostnames as inputs. As a result, the hardcoded configurations are distributed among the hostnames with each hostname receving one set of configurations. The result is written to a <hostname>.json file. compute.py with one of these JSON files as input will run the experiment on each configuration in the input file.
I distributed the computation to ten machines and the results came back within a few minutes.

Results

Discussion

Appendices

Motivation

Evolving code bases sometimes suffer from regressions. Regressions can be defined as something that once worked and now does not work anymore.

Searching for regressions in code bases can very well be done with Git Bisect 2. As a prerequisite, one needs a test to determine if the fault is present or not. Also, the last version should be faulty and a previous version should be non-faulty.

Then, a binary search can be done over the h istory. This is extremely fast, because it requires only log(N) iterations. However, the prerequisite was that we can test for the faulty-ness.

However, not all faults are deterministic. Some faults are 'intermittent' or 'stochastic'. In this analysis, I will evaluate code on how to find the location of a fault given the stochastic behavior of a fault.

BBChop

In 2009, "Ealdwulf Wuffinga" announced 1 on the git mailing list, that he had implemented a more general algorithm to finding bugs. He called it BBChop, for Bayesian Binary Chop. It is based on Bayesian Search Theory.

BBChop is like git bisect (or equivalent), but works when your bug is intermittent. That is, it works in the presence of false negatives (when a version happens to work this time even though it contains the bug). 3

Possible Examples

Stochastic Algorithms.
- Particle Filter
- MCMC
- RANSAC
- Probabilistic Roadmaps
- Machine Configurations

Analysis

I used BBChop and ran experimental runs with varying parameters and recorded how many trials where necessary to identify faults.

Results



In [1]:

    
%matplotlib inline

import json
import pandas as pd
import seaborn as sn



In [2]:

    
with open('results.json') as f: d = json.load(f)



In [3]:

    
df = pd.DataFrame.from_dict(d)
df.head()

You can see that there are the following columns:

N: Size of the range to search. For example, the number of commits in a code base. I used 10, 100, and 1000.
certainty: How certain to be before terminating the search. I used two values here. 0.9 and 0.7 (90% and 70%, respectively).
fail_loc: The location where the first failure occured. I used 0.7 * N, and 0.5 * N.
fail_prob: The rate of failure, if the fault is present. So after fail_loc it will fail with probability fail_prob, before fail_loc it will never fail.
seed: The pseudo-random number generator's seed. I used this, so that results would be reproducible.
success: If the found location (where) equals the fail_loc.
trials: How many trials were needed to come to a solution.
where: Where the fault was thought to be. If that location is the true faulty location, then success == True.



In [22]:

    
sn.lmplot(x='fail_prob', y='trials', data=df.query('N in [10, 100, 1000]'), hue='N', fit_reg=False)









    Out[22]:





<seaborn.axisgrid.FacetGrid at 0x1096d3cc0>



In [5]:

    
sn.distplot(df['seed'])









    Out[5]:





<matplotlib.axes._subplots.AxesSubplot at 0x106968f98>

Not very helpful, the distribution of seeds. Basically between 1 and 9.

Trials vs Probability is not very informative. Not even, when I separate by 'N', the size of my history.

But I am suspecting that there is a bug for small probabilities. Let's filter those out.



In [23]:

    
df_large_prob = df[df.fail_prob > 0.1]
sn.lmplot(x='fail_prob', y='trials', data=df_large_prob.query('N in [10, 100, 1000]'), hue='N')









    Out[23]:





<seaborn.axisgrid.FacetGrid at 0x109829358>

That is more like it.

The higher 'N', the more trials are needed. This is what I would expect.
The lower the failure probability, the more trials are needed. Again, that sounds very reasonable.

Next, let's do a violin plot to see the distribution at each probability.



In [21]:

    
sn.violinplot(x='fail_prob', y='trials', hue='N',
                     data=df_large_prob.query('N in [10, 100, 1000]'))









    Out[21]:





<matplotlib.axes._subplots.AxesSubplot at 0x1093d6dd8>

Can you see anything? I can't.

But, I think, I'd like to see the same for only N = 1000 and split the Violin Plot on success. Because, honestly, I don't care much about success == False cases.



In [8]:

    
df_N1000 = df.query('N == 1000')
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000, split=True)









    Out[8]:





<matplotlib.axes._subplots.AxesSubplot at 0x108212d30>

Very nice! First observation: Probabilities below 0.1 are different. Just, to be save, let's do a log-plot. Second observation: At low probabilities success or failure makes all the difference to the number of trials.



In [9]:

    
import matplotlib.pyplot as plt
f, ax = plt.subplots(figsize=(7, 7))
ax.set(yscale="log")
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000, split=True, ax=ax)









    Out[9]:





<matplotlib.axes._subplots.AxesSubplot at 0x1081c70b8>

That looks certainly better, but for now, I want to limit my observation to p in [0.1, 1.0]. And stay with a linear axis.

Next, loet's repeat the step from above and show trials vs probabilites for N = 1000, and p >= 0.1



In [10]:

    
df_N1000_large_p = df.query('N == 1000').query('fail_prob >= 0.1')
sn.violinplot(x='fail_prob', y='trials', hue='success', data=df_N1000_large_p, split=True)









    Out[10]:





<matplotlib.axes._subplots.AxesSubplot at 0x1085903c8>

My colleague saw this the other day, and asked whether the distributions are bimodal. So, how many data points are there for each violin-plot?



In [11]:

    
df.query('N == 1000').groupby('fail_prob').size()









    Out[11]:





fail_prob
0.005    36
0.010    36
0.100    36
0.250    36
0.500    36
0.900    36
1.000    36
dtype: int64

192 values for 0.250. That seems enough.

Next, I want to produce another plot. Trials vs Commits to search, parameterised over p.



In [12]:

    
sn.lmplot(x='N', y='trials', data=df, hue='fail_prob')









    Out[12]:





<seaborn.axisgrid.FacetGrid at 0x108759828>

Not useful at all.

I need some values for N between 100 and 1000.
There are too many 'hues' or 'fail_prob'. --> In the next plot, I will only pick [0.1, 0.5, 1.0].
That linear interpolation is obviously not what I want.
- Let's turn it off.
- Or, let's replace it with a log(N) regression. Why log(N), because that is what I get with binary search.

Let's thin the plot down a bit:



In [13]:

    
sn.lmplot(x='N', y='trials', data=df.query('fail_prob in [0.1, 0.5, 1.0]'), hue='fail_prob')









    Out[13]:





<seaborn.axisgrid.FacetGrid at 0x108590940>



In [14]:

    
# Better, but let us try with only 'success'
locs = [int(0.5 * n) for n in set(df['N'])]
sn.lmplot(x='N', y='trials', data=df.query('fail_prob in [0.1, 0.5, 1.0] \
                                              and fail_loc in {} \
                                              and certainty == 0.9 \
                                              and success == True'.format(locs)),
          hue='fail_prob', fit_reg=False)









    Out[14]:





<seaborn.axisgrid.FacetGrid at 0x108b945c0>

Plot 3: BBChop vs Git Bisect

How does BBChop fare against Git Bisect, if the fault is not intermittent. For that let's compare the graph of p = 1.0 and Git Bisect's ceil(log(N, base=2)).



In [15]:

    
# df2: A view of [N, trials]
df2 = df.query('fail_prob == 1.0 and success == True and certainty == 0.7')[['N', 'trials']]
df2.insert(len(df2.columns), 'algorithm', 'BBChop')

# Git Bisect performance.
x = list(range(10, 1000, 100))
from math import ceil, log2
y = list(map(lambda n: ceil(log2(n)), x))
df3 = pd.DataFrame.from_items([('N', x), ('trials', y), ('algorithm', 'Git Bisect')])

sn.lmplot(x='N', y='trials', data=pd.concat([df2, df3]), hue='algorithm', fit_reg=False)









    Out[15]:





<seaborn.axisgrid.FacetGrid at 0x108c9e978>



In [16]:

    
# Is the certainty accurate?

for c in [0.7, 0.9]:
    c_70 = df.query('certainty == {}'.format(c)).groupby('success').size()
    print('Success rate of algorithm: {}; expected {}.'.format( c_70[True] / (c_70[True] + c_70[False]), c))









    



Success rate of algorithm: 0.7197530864197531; expected 0.7.
Success rate of algorithm: 0.8666666666666667; expected 0.9.

Discussion

The algorithm works.

Appendix A: Comparing Log vs Linear Likelihoods

Logarithmic Implementation

When running a test provided with BBChop, I noticed that for large number of detections, Beta(d+1, t+1) * P(L) evaluates to 0, for d, t being large (~ 640). This might be related to the poor performance at p < 0.1. [It is not, as my analysis later on shows.]

Developers with practical experience in probability algorithms know, that it is often advisable for numerical stability to use the logarithm of probabilities rather than raw probabilities. The developer of BBChop knew this too, which is obvious from his use of logarithms inside the math functions, and of his use of Fraction module to avoid floats.

However, Beta is implemented as

def Beta(a, b):
   return exp(log(Gamma(a)) + log(Gamma(b) - log(Gamma(a+b))

It was not obvious to me, how to make use of logarithms all the way through the computations. In retrospect, only a small change was required to fix the observed behaviour.

There is the line for li in likelihoods: probs.append(li/likelihoodTot), with likelihoodTot being the sum over some Beta(., .) * P(.).

One possible solution would be to use the identity log(a + b) = log(a) + log(1 + b/a). http://stats.stackexchange.com/a/66621 : Suggests to "Substract the maximum logarithm from all logs."

This was indeed the solution.

\begin{align} P_i &= \frac{L_i}{\sum_i L_i} \end{align}

$\DeclareMathOperator*{\argmax}{argmax}$

Let $L_i$ be scaled by $k$, one sees that the above equation still holds. Thus $\log L_i = \log( \tilde{L_i} k) = \log(\tilde{L_i}) + \log(k)$. Now, choose $\log(k)$ such that $\exp(\log(L_i))$ does not underflow. This can be achieved by making the largest value equal to 1. As such, for $i = \argmax \log L_i$, $\exp(\log L_i ) = 1 \Leftrightarrow$ $\log L_i = 0 \Leftrightarrow$ $\log \tilde{L_i} = - \log k$.

I implemented it and the code that erred now passes.

Evaluate if this made a difference.

In a later section I show that this did not result in different results. However, my test data does not consist of so large number of observations. Remember the bug happens when

This fails for B(638 + 1, 664 + 1).

where 638 and 664 are the number of faulty and non-faulty (or absolute?) observations at a given location.

As mentioned above, not using Log Likelihoods was leading at a high number of observations to an underflow in the exponential.

Does this explain the bad performance at low probabilities?



In [17]:

    
with open('results_non_logarithm.json', 'r') as f: d = json.load(f)
df_non_log = pd.DataFrame.from_dict(d)
df_non_log.head()



In [18]:

    
# Maybe a good comparison between Log vs Linear is for each config plot trials[log] vs trials[lin]? 

# Drop duplicates
df_non_log.drop_duplicates(inplace=True)

merge = pd.merge(df, df_non_log, on = ['N', 'certainty', 'fail_loc', 'fail_prob', 'seed', 'where'], how='inner')

trials_log = merge['trials_x']
trials_lin = merge['trials_y']

sn.lmplot(x='trials_x', y='trials_y', data = merge, fit_reg=False)









    Out[18]:





<seaborn.axisgrid.FacetGrid at 0x108c81cc0>



In [19]:

    
non_equals = merge.query('trials_x != trials_y')
assert(len(non_equals) == 0)

This shows that the behavior is the same before and after fixing the exponential-underflow bug.

Appendix B: An alternative method of computation

Algorithm:

# Run binary search on the history.
1. Pick the midpoint of the history.
2. Run the test.
3. While test passes and number of test runs at current location is less then $k$, continue running the test at current location.
4. Mark the current midpoint as either failed or passed according to the result from line 3.
5. Compute the next midpoint and continue from line 1, if there is still a midpoint.

This algorithm would have the following characteristics.

The certainty of not misclassifying a location $c_i$ is $c_i \lt 1 - (1 - p)^k$. The certainty of not misclassifying any location is $c = \Pi_i^n c_i = \Pi_i^n 1 - (1-p)^k = (1 - (1-p)^k)^n$. $p$ is the fault probability, $n$ is the number of locations to evaluate. As such we have $n = \left\lceil \log_2 N \right\rceil$. We can re-arrange this to find the number of tests that we have to evaluate at each location $k$. $k = \left\lceil \frac{\log (1 - \exp(\log c / n) )}{\log (1-p)} \right\rceil$ (Valid for $p \lt 1$.)

The maximum number of trials to undertake become:

\begin{align} k n &= \left\lceil \frac{\log (1 - \exp(\log c / n) )}{\log (1-p)} \right\rceil n \\ \end{align}



In [20]:

    
p = 0.5
certainty = 0.9

condition = 'fail_prob == {p} and success == True and certainty == {cert}'.format(p=p, cert=certainty)
df2 = df.query(condition)[['N', 'trials']]
df2.insert(len(df2.columns), 'algorithm', 'BBChop')


# Naive Chop performance.
x = list(range(10, 1000, 100))
from math import ceil, log2, log, exp
f_n = lambda n: ceil(log2(n))
numerator = lambda n: log(1 - exp(log(certainty) / n))
f = lambda n: ceil(numerator(n) / log(1-p)) * f_n(n)

# Average case
# f = lambda n: ceil(log2(n)) * (ceil(0.5 * 1. / p) + ceil(0.5 * log(1-certainty)/log(1-p)))
y = list(map(f, x))
df3 = pd.DataFrame.from_items([('N', x), ('trials', y), ('algorithm', 'Naive Chop')])

sn.lmplot(x='N', y='trials', data=pd.concat([df2, df3]), hue='algorithm', fit_reg=False)









    Out[20]:





<seaborn.axisgrid.FacetGrid at 0x1067f7860>

From this it may safely be assumed that BBChop has better complexity than Naive Chop. However, this is less clear for the expected number of trials needed.

	N	certainty	fail_loc	fail_prob	seed	success	trials	where
0	900	0.7	630	1.000	5	True	16	630
1	10	0.9	7	0.005	6	False	14	9
2	300	0.7	210	1.000	1	True	13	210
3	10	0.9	5	0.250	6	True	64	5
4	500	0.7	250	1.000	7	True	14	250

	N	certainty	fail_loc	fail_prob	seed	success	trials	where
0	1000	0.7	700	0.010	1	False	629	904
1	1000	0.7	500	1.000	2	True	13	500
2	10	0.7	7	0.005	4	True	1238	7
3	100	0.9	50	0.010	1	True	1276	50
4	100	0.9	50	0.250	1	True	47	50

Bayesian Search for Debugging: An Evaluation of BBChop

Introduction

Method

Infrastructure

Results

Discussion

Appendices

Further Reading

Motivation

BBChop

Possible Examples

Analysis

Results

Plot 3: BBChop vs Git Bisect

Discussion

Appendix A: Comparing Log vs Linear Likelihoods

Logarithmic Implementation

Evaluate if this made a difference.

Appendix B: An alternative method of computation